Goto

Collaborating Authors

 action type



Grounded ReinforcementLearning: LearningtoWintheGameunderHumanCommands SupplementaryMaterials

Neural Information Processing Systems

Inthis section, we describe the details ofMiniRTSEnvironment and human dataset. The data do not contain any personally identifiable information or offensivecontent. Figure 1: MiniRTS [2]implements the rockpaper-scissors attack graph, each army type has some units it is effective against and vulnerableto. "swordman","spearman"and"cavalry"allare effectiveagainst"archer" Figure 2: Building units can produce different army units using resources. Resource Units: Resource units are stationary and neutral.


See, Think, Act: Online Shopper Behavior Simulation with VLM Agents

Zhang, Yimeng, Gesi, Jiri, Xue, Ran, Wang, Tian, Wang, Ziyi, Lu, Yuxuan, Zhan, Sinong, Zeng, Huimin, Cui, Qingjun, Guo, Yufan, Huang, Jing, Shah, Mubarak, Wang, Dakuo

arXiv.org Artificial Intelligence

LLMs have recently demonstrated strong potential in simulating online shopper behavior. Prior work has improved action prediction by applying SFT on action traces with LLM-generated rationales, and by leveraging RL to further enhance reasoning capabilities. Despite these advances, current approaches rely on text-based inputs and overlook the essential role of visual perception in shaping human decision-making during web GUI interactions. In this paper, we investigate the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs, leveraging OPeRA dataset. By grounding agent decision-making in both textual and visual modalities, we aim to narrow the gap between synthetic agents and real-world users, thereby enabling more cognitively aligned simulations of online shopping behavior. Specifically, we employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context, which comprises action history, past HTML observations, and the current webpage screenshot. To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor that prioritizes challenging decision points. Empirically, our studies show that incorporating visual grounding yields substantial gains: the combination of text and image inputs improves exact match accuracy by more than 6% over text-only inputs. These results indicate that multi-modal grounding not only boosts predictive accuracy but also enhances simulation fidelity in visually complex environments, which captures nuances of human attention and decision-making that text-only agents often miss. Finally, we revisit the design space of behavior simulation frameworks, identify key methodological limitations, and propose future research directions toward building efficient and effective human behavior simulators.


WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation

Qian, Yaoyao, Wang, Yuanli, Zhang, Jinda, Zong, Yun, Chen, Meixu, Zhou, Hanhan, Huang, Jindan, Zeng, Yifan, Hu, Xinyu, Song, Chan Hee, Zhang, Danqing

arXiv.org Artificial Intelligence

Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.


VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Lu, Dunjie, Xu, Yiheng, Wang, Junli, Wu, Haoyuan, Wang, Xinyuan, Wang, Zekun, Yang, Junlin, Su, Hongjin, Chen, Jixuan, Chen, Junda, Mao, Yuchen, Zhou, Jingren, Lin, Junyang, Hui, Binyuan, Yu, Tao

arXiv.org Artificial Intelligence

Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.


Large-scale User Game Lifecycle Representation Learning

Gou, Yanjie, Liu, Jiangming, Xue, Kouying, Hu, Yi

arXiv.org Artificial Intelligence

However, existing representation learning methods crafted for handling billions of items in recommendation systems are unsuitable for game advertising and recommendation. This is primarily due to game sparsity, where the mere hundreds of games fall short for large-scale user representation learning, and game imbalance, where user behaviors are overwhelmingly dominated by a handful of popular games. To address the sparsity issue, we introduce the User Game Lifecycle (UGL), designed to enrich user behaviors in games. Additionally, we propose two innovative strategies aimed at manipulating user behaviors to more effectively extract both short and long-term interests. To tackle the game imbalance challenge, we present an Inverse Probability Masking strategy for UGL representation learning. The offline and online experimental results demonstrate that the UGL representations significantly enhance model by achieving a 1.83% AUC offline increase on average and a 21.67% CVR online increase on average for game advertising and a 0.5% AUC offline increase and a 0.82% ARPU online increase for in-game item recommendation.


Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Wang, Ziyi, Lu, Yuxuan, Zhang, Yimeng, Huang, Jing, Wang, Dakuo

arXiv.org Artificial Intelligence

Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.



Grounded Reinforcement Learning: Learning to Win the Game under Human Commands Supplementary Materials

Neural Information Processing Systems

In this section, we describe the details of MiniRTS Environment and human dataset. "spearman" but is retrained by "cavarly". "swordman", "spearman" and "cavalry" all are Figure 2: Building units can produce different army units using resources. "workshop" can produce "archer", "dragon" and "catapult" while other Resource Units: Resource units are stationary and neutral. Resource units cannot be constructed by anyone and are created at the beginning of a game. Building Units: MiniRTS supports 6 different building unit types.


Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Zhang, Yimeng, Wang, Tian, Gesi, Jiri, Wang, Ziyi, Lu, Yuxuan, Lin, Jiacheng, Zhan, Sinong, Gao, Vianne, Jiao, Ruochen, Liu, Junze, Qian, Kun, Tang, Yuxin, Xue, Ran, Zhang, Houyu, Cui, Qingjun, Guo, Yufan, Wang, Dakuo

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.